Project: Investigate Indicators from the Gapminder Dataset

Table of Contents

Introduction

In this report, I will analyze 3 indicators of the Gapminder Dataset:

  • C02 emmissions per person
  • Life expectancy
  • Income per person (GDP per capita)

Each dataset contains the evolution of each indicator in time from 1800 to today for many countries.

In this report, I will try to answer those questions:

  1. How each indicator evolve in time and what is their distribution?
  2. Is there any corelation between the indicators?
In [1]:
# Import
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Data Wrangling

General Properties

In this part, for each dataset, I load the data from a cvs file into a Pandas DataFrame, I rename the index and column and display some info about the DataFrame

In [2]:
print("CO2 DataFrame\n")
# Load CO2 csv file into a Pandas DataFrame
df_co2 = pd.read_csv('co2_emissions_tonnes_per_person.csv', index_col='country')
# Rename the index and the column
df_co2.columns.name = "Year"
df_co2.index.name = "Country"
# Display info about the dataframe
print(df_co2.info())
df_co2.head()
CO2 DataFrame

<class 'pandas.core.frame.DataFrame'>
Index: 192 entries, Afghanistan to Zimbabwe
Columns: 219 entries, 1800 to 2018
dtypes: float64(219)
memory usage: 330.0+ KB
None
Out[2]:
Year 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 ... 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
Country
Afghanistan NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 0.238 0.29 0.406 0.345 0.28 0.253 0.262 0.245 0.247 0.254
Albania NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 1.470 1.56 1.790 1.690 1.69 1.900 1.600 1.570 1.610 1.590
Algeria NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 3.400 3.28 3.270 3.430 3.48 3.680 3.800 3.640 3.560 3.690
Andorra NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 6.120 6.12 5.870 5.920 5.90 5.830 5.970 6.070 6.270 6.120
Angola NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 1.230 1.24 1.250 1.350 1.28 1.640 1.220 1.180 1.140 1.120

5 rows × 219 columns

In [3]:
print("Income DataFrame\n")
# Load Income csv file into a Pandas DataFrame
df_income = pd.read_csv('income_per_person_gdppercapita_ppp_inflation_adjusted.csv', index_col='country')
# Rename the index and the column
df_income.columns.name = "Year"
df_income.index.name = "Country"
# Display info about the dataframe
print(df_income.info())
df_income.head()
Income DataFrame

<class 'pandas.core.frame.DataFrame'>
Index: 193 entries, Afghanistan to Zimbabwe
Columns: 241 entries, 1800 to 2040
dtypes: int64(241)
memory usage: 364.9+ KB
None
Out[3]:
Year 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 ... 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040
Country
Afghanistan 603 603 603 603 603 603 603 603 603 603 ... 2550 2600 2660 2710 2770 2820 2880 2940 3000 3060
Albania 667 667 667 667 667 668 668 668 668 668 ... 19400 19800 20200 20600 21000 21500 21900 22300 22800 23300
Algeria 715 716 717 718 719 720 721 722 723 724 ... 14300 14600 14900 15200 15500 15800 16100 16500 16800 17100
Andorra 1200 1200 1200 1200 1210 1210 1210 1210 1220 1220 ... 73600 75100 76700 78300 79800 81500 83100 84800 86500 88300
Angola 618 620 623 626 628 631 634 637 640 642 ... 6110 6230 6350 6480 6610 6740 6880 7020 7160 7310

5 rows × 241 columns

In [4]:
print("Life expectency DataFrame\n")
# Load Life Expectency csv file into a Pandas DataFrame
df_life_exp = pd.read_csv('life_expectancy_years.csv', index_col='country')
# Rename the index and the column
df_life_exp.columns.name = "Year"
df_life_exp.index.name = "Country"
# Display info about the dataframe
print(df_life_exp.info())
df_life_exp.head()
Life expectency DataFrame

<class 'pandas.core.frame.DataFrame'>
Index: 187 entries, Afghanistan to Zimbabwe
Columns: 301 entries, 1800 to 2100
dtypes: float64(301)
memory usage: 441.2+ KB
None
Out[4]:
Year 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 ... 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100
Country
Afghanistan 28.2 28.2 28.2 28.2 28.2 28.2 28.1 28.1 28.1 28.1 ... 76.5 76.6 76.7 76.9 77.0 77.1 77.3 77.4 77.5 77.7
Albania 35.4 35.4 35.4 35.4 35.4 35.4 35.4 35.4 35.4 35.4 ... 87.4 87.5 87.6 87.7 87.8 87.9 88.0 88.1 88.2 88.3
Algeria 28.8 28.8 28.8 28.8 28.8 28.8 28.8 28.8 28.8 28.8 ... 88.3 88.4 88.5 88.6 88.7 88.8 88.9 89.0 89.1 89.2
Andorra NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Angola 27.0 27.0 27.0 27.0 27.0 27.0 27.0 27.0 27.0 27.0 ... 78.7 78.9 79.0 79.1 79.3 79.4 79.5 79.7 79.8 79.9

5 rows × 301 columns

From those displayed info, we can notice that:

  • Each dataset has Countries as Index and Years as column.
  • Each dataset starts in 1800 and seem to end in 2018, but 2 of them have extrapolated values in the future.
  • There are a lot of missing data for the CO2 indicator in the first years.
  • The life expectency indicator is missing for many years for some countries.
  • Indicators are in float64 or int64 format.
  • Each dataframe doesn't have the same number of countries (index).

Data Cleaning

In this section, I will clean the data. To do so, I will:

  • Change the type of the column from string to int to be able to perform some operations and comparisons on the year values.
  • Remove the years (column) in the future because they don't bring any information to help answer my questions.
  • Fill the NaN values in 2 steps:
    • Back fill the NaN by propagating the next valid value to fill the NaNs (this will fill all NaN except if there is just NaN until the end of the row) then
    • forward fill the NaN by propagating the last valid value to fill the rest of the NaNs.
  • Change income values from int64 to float64 in order to have all indicators in the same float64 type.

I created a python file containing all the cleaning function because they are the same for each indicator DataFrame.

In [5]:
import data_cleaning as clean

clean.columns2Int(df_co2)
clean.removeYearsInFuture(df_co2)
clean.fillNaN(df_co2)

print("CO2 cleaned DataFrame\n")
print(df_co2.info())
print("\nIs there some leftover NaN values: {}".format(df_co2.isna().any().all()))
df_co2.head()
CO2 cleaned DataFrame

<class 'pandas.core.frame.DataFrame'>
Index: 192 entries, Afghanistan to Zimbabwe
Columns: 219 entries, 1800 to 2018
dtypes: float64(219)
memory usage: 330.0+ KB
None

Is there some leftover NaN values: False
Out[5]:
Year 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 ... 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
Country
Afghanistan 0.00192 0.00192 0.00192 0.00192 0.00192 0.00192 0.00192 0.00192 0.00192 0.00192 ... 0.238 0.29 0.406 0.345 0.28 0.253 0.262 0.245 0.247 0.254
Albania 0.00711 0.00711 0.00711 0.00711 0.00711 0.00711 0.00711 0.00711 0.00711 0.00711 ... 1.470 1.56 1.790 1.690 1.69 1.900 1.600 1.570 1.610 1.590
Algeria 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 ... 3.400 3.28 3.270 3.430 3.48 3.680 3.800 3.640 3.560 3.690
Andorra 7.47000 7.47000 7.47000 7.47000 7.47000 7.47000 7.47000 7.47000 7.47000 7.47000 ... 6.120 6.12 5.870 5.920 5.90 5.830 5.970 6.070 6.270 6.120
Angola 0.04110 0.04110 0.04110 0.04110 0.04110 0.04110 0.04110 0.04110 0.04110 0.04110 ... 1.230 1.24 1.250 1.350 1.28 1.640 1.220 1.180 1.140 1.120

5 rows × 219 columns

In [6]:
clean.columns2Int(df_income)
clean.removeYearsInFuture(df_income)
clean.fillNaN(df_income)
df_income = clean.values2Float(df_income)

print("Income cleaned DataFrame\n")
print(df_income.info())
print("\nIs there some leftover NaN values: {}".format(df_income.isna().any().all()))
df_income.head()
Income cleaned DataFrame

<class 'pandas.core.frame.DataFrame'>
Index: 193 entries, Afghanistan to Zimbabwe
Columns: 219 entries, 1800 to 2018
dtypes: float64(219)
memory usage: 331.7+ KB
None

Is there some leftover NaN values: False
Out[6]:
Year 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 ... 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
Country
Afghanistan 603.0 603.0 603.0 603.0 603.0 603.0 603.0 603.0 603.0 603.0 ... 1500.0 1670.0 1630.0 1770.0 1810.0 1800.0 1770.0 1760.0 1760.0 1740.0
Albania 667.0 667.0 667.0 667.0 667.0 668.0 668.0 668.0 668.0 668.0 ... 9520.0 9930.0 10200.0 10400.0 10500.0 10700.0 11000.0 11400.0 11800.0 12300.0
Algeria 715.0 716.0 717.0 718.0 719.0 720.0 721.0 722.0 723.0 724.0 ... 12700.0 12900.0 13000.0 13200.0 13300.0 13500.0 13800.0 13900.0 13900.0 13900.0
Andorra 1200.0 1200.0 1200.0 1200.0 1210.0 1210.0 1210.0 1210.0 1220.0 1220.0 ... 41700.0 39000.0 42000.0 41900.0 43700.0 44900.0 46600.0 48200.0 49800.0 51500.0
Angola 618.0 620.0 623.0 626.0 628.0 631.0 634.0 637.0 640.0 642.0 ... 6290.0 6360.0 6350.0 6640.0 6730.0 6810.0 6640.0 6260.0 6040.0 5720.0

5 rows × 219 columns

In [7]:
clean.columns2Int(df_life_exp)
clean.removeYearsInFuture(df_life_exp)
clean.fillNaN(df_life_exp)

print("Life expectency cleaned DataFrame\n")
print(df_life_exp.info())
print("\nIs there some leftover NaN values: {}".format(df_life_exp.isna().any().all()))
df_life_exp.head()
Life expectency cleaned DataFrame

<class 'pandas.core.frame.DataFrame'>
Index: 187 entries, Afghanistan to Zimbabwe
Columns: 219 entries, 1800 to 2018
dtypes: float64(219)
memory usage: 321.4+ KB
None

Is there some leftover NaN values: False
Out[7]:
Year 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 ... 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
Country
Afghanistan 28.2 28.2 28.2 28.2 28.2 28.2 28.1 28.1 28.1 28.1 ... 59.3 59.9 60.4 60.8 61.3 61.2 61.2 61.2 63.4 63.7
Albania 35.4 35.4 35.4 35.4 35.4 35.4 35.4 35.4 35.4 35.4 ... 77.5 77.6 77.7 77.8 77.9 77.9 78.0 78.1 78.2 78.3
Algeria 28.8 28.8 28.8 28.8 28.8 28.8 28.8 28.8 28.8 28.8 ... 76.1 76.3 76.5 76.8 76.9 77.0 77.1 77.4 77.7 77.9
Andorra 75.5 75.5 75.5 75.5 75.5 75.5 75.5 75.5 75.5 75.5 ... 82.2 82.3 82.4 82.4 82.5 82.5 82.6 82.7 82.7 82.7
Angola 27.0 27.0 27.0 27.0 27.0 27.0 27.0 27.0 27.0 27.0 ... 59.1 59.9 60.6 61.3 61.9 62.8 63.3 63.8 64.2 64.6

5 rows × 219 columns

We can now notice that:

  • All indicators values are in float64.
  • Each indicators evolves from 1800 to 2018.
  • There is no more NaN values in the DataFrames.
  • The countries (indexes) list is not the same for each DataFrame, but that is not so important to answer the first question. I will deal with this issue later in the report.

Those 3 Dataframes are enough to explore and answer the question 1. However, we still need some cleaning and merging to explore and answer the questin 2. In order to explore some potential corelations between the indicators, we need to concatenate them. To do so, I first transform each DataFrame into a Multi indexing Serie and I concatenate afterwards the 3 Series together. At the end, I have a multi indexing DataFrame with (Country, Year) as index which has 3 columns (CO2, Income and LifeExp).

In [8]:
print("Multi Index DataFrame\n")
df_multi = pd.concat([df_co2.stack(), df_income.stack(), df_life_exp.stack()], axis=1)
df_multi.columns.name = "Indices"
df_multi.columns = ["CO2", "Income", "LifeExp"]
print(df_multi.info())
df_multi
Multi Index DataFrame

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 42486 entries, ('Afghanistan', 1800) to ('Zimbabwe', 2018)
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CO2      42048 non-null  float64
 1   Income   42267 non-null  float64
 2   LifeExp  40953 non-null  float64
dtypes: float64(3)
memory usage: 1.1+ MB
None
Out[8]:
CO2 Income LifeExp
Country Year
Afghanistan 1800 0.00192 603.0 28.2
1801 0.00192 603.0 28.2
1802 0.00192 603.0 28.2
1803 0.00192 603.0 28.2
1804 0.00192 603.0 28.2
... ... ... ... ...
Zimbabwe 2014 0.88100 2510.0 58.5
2015 0.88100 2510.0 59.6
2016 0.77100 2490.0 60.5
2017 0.84500 2570.0 61.4
2018 0.85000 2620.0 61.7

42486 rows × 3 columns

As I mention above, the 3 initial DataFrames didn't have the same list of countries as index. That means that we have some NaN values in our final multi indexing Dataframe. I decide to simply drop the rows that contain NaN.

In [9]:
print("Multi Index cleaned DataFrame\n")
df_multi.dropna(inplace=True)
print("\nIs there some leftover NaN values: {}".format(df_multi.isna().any().all()))
df_multi.info()
Multi Index cleaned DataFrame


Is there some leftover NaN values: False
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 40953 entries, ('Afghanistan', 1800) to ('Zimbabwe', 2018)
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CO2      40953 non-null  float64
 1   Income   40953 non-null  float64
 2   LifeExp  40953 non-null  float64
dtypes: float64(3)
memory usage: 1.1+ MB

Now that I have all the cleaned DataFrame available to answer my 2 question, I can move forward with the Exploratory phase.

Exploratory Data Analysis

Research Question 1

In this section I will explore the data to analyze the distribution and evolution in time of the 3 indicators independently from one another. For each indicator I will explore:

  • The distribution of the indicators for a given set of years using some histograms and some key values (avg, std, ...) for the same year.
  • The evolution in time of the indicators for a given set of countries and some key values (avg, std, ...) for the same countries.
In [10]:
# Global variable used for the data analysis for each DataFrame
years_to_plot = [1800, 1850, 1900, 1950, 2000, 2018]
countries_to_plot = ['Australia', 'Brazil', 'Burkina Faso', 'China', 'France', 'Iraq', 'United States']

CO2 Emission exploration

In [11]:
df_co2[years_to_plot].plot(kind='hist',subplots=True, layout=(3,2), title='Distribution of CO2 consumption per person (in Tonnes) around the world for some given years', figsize=(15,10), grid=True, sharey=True, sharex=False);
In [12]:
df_co2[years_to_plot].describe()
Out[12]:
Year 1800 1850 1900 1950 2000 2018
count 192.000000 192.000000 192.000000 192.000000 192.000000 192.000000
mean 0.524797 0.550436 0.910708 1.761765 4.524371 4.455041
std 1.909994 1.925194 2.252759 3.286148 6.760072 5.609198
min 0.000000 0.000000 0.000000 0.000000 0.019000 0.024300
25% 0.004585 0.004585 0.020825 0.073225 0.492750 0.669250
50% 0.040950 0.042650 0.106000 0.385000 1.990000 2.530000
75% 0.177500 0.209750 0.604250 1.665000 6.457500 5.925000
max 17.300000 17.300000 17.300000 25.100000 58.400000 38.000000
In [13]:
import data_exploration as explore

explore.plotLine(df_co2, countries_to_plot, title="Evolution of CO2 emissions per person", ylabel="CO2 emissions (Tonnes)")
In [14]:
df_co2.loc[countries_to_plot].transpose().describe()
Out[14]:
Country Australia Brazil Burkina Faso China France Iraq United States
count 219.000000 219.000000 219.000000 219.000000 219.000000 219.000000 219.000000
mean 5.528219 0.517873 0.020800 0.782234 3.561421 1.022637 9.779812
std 6.426208 0.665737 0.041616 1.637011 2.843197 1.521417 7.865244
min 0.000000 0.087800 0.000780 0.000230 0.067300 0.000000 0.042200
25% 0.000000 0.114000 0.000780 0.000240 0.848500 0.000000 1.310000
50% 3.420000 0.126000 0.000780 0.035900 3.030000 0.000000 11.200000
75% 9.580000 0.701000 0.018800 0.679500 5.840000 2.290000 16.450000
max 19.300000 2.590000 0.198000 7.060000 9.960000 5.320000 22.200000

CO2 evolution Analysis

  • From the distribution histogram for some given years, we can notice that the CO2 consumption is always skewed on the right. The skewing is getting less strong with time. That means that there is a huge disparity in CO2 consumption between most of the countries and a bunch of countries that have a much more elevated consumption.
  • From the table of main values for some given years, we can see that the average and the standard deviation of the consumption increase over time until 2000 where it stabilized. That confirms the skewing to the right getting lower with the time. This tells us also that globally, the CO2 consumption increased until around 2000.
  • The time line plots for a given set of countries confirms that the consumption increased over time for most of the countries. This shows also the consumption difference between countries and confirm the skewing from the histograms.
  • From the table of main values for some given countries, we can also confirm a huge difference in consumption between the countries that consum a little (Burkina Faso) and the huge consumers (United States).

Income exploration

In [15]:
df_income[years_to_plot].hist(figsize=(20,20));
In [16]:
#df_income[years_to_plot].plot(kind='hist',subplots=True, layout=(3,2), title='Distribution of Income per person (GDP per capita Inflation adjusted) around the world for some given years', figsize=(15,10), grid=True, sharey=True, sharex=False);
In [17]:
df_income[years_to_plot].describe()
Out[17]:
Year 1800 1850 1900 1950 2000 2018
count 193.000000 193.000000 193.000000 193.000000 193.000000 193.000000
mean 978.419689 1202.088083 1897.823834 4019.326425 14235.901554 18462.658031
std 579.466695 805.683249 1703.220938 5378.751466 18512.580390 19615.694741
min 250.000000 246.000000 318.000000 345.000000 573.000000 629.000000
25% 592.000000 664.000000 845.000000 1350.000000 2710.000000 3740.000000
50% 817.000000 933.000000 1250.000000 2580.000000 7220.000000 12100.000000
75% 1160.000000 1450.000000 2250.000000 4500.000000 16300.000000 27100.000000
max 3840.000000 5210.000000 13800.000000 59800.000000 108000.000000 113000.000000
In [18]:
explore.plotLine(df_income, countries_to_plot, title="Evolution of Income per person", ylabel="Income (GDP per capita Inflation adjusted)")
In [19]:
df_income.loc[countries_to_plot].transpose().describe()
Out[19]:
Country Australia Brazil Burkina Faso China France Iraq United States
count 219.000000 219.000000 219.000000 219.000000 219.000000 219.000000 219.000000
mean 12672.045662 3847.762557 643.114155 1593.757991 10539.315068 8759.086758 14263.378995
std 11862.548217 4233.670412 271.616957 2626.663611 11470.583197 6832.529458 15163.910727
min 817.000000 1060.000000 480.000000 560.000000 1850.000000 3840.000000 1960.000000
25% 4840.000000 1120.000000 499.500000 732.500000 2725.000000 4080.000000 3245.000000
50% 8480.000000 1430.000000 525.000000 772.000000 4760.000000 5380.000000 7430.000000
75% 17300.000000 4710.000000 688.500000 850.500000 14150.000000 10950.000000 19900.000000
max 45400.000000 15500.000000 1760.000000 16200.000000 39600.000000 37400.000000 55700.000000

Income analysis

  • From the distribution histogram for some given years, we can notice that the Income is always skewed on the right. The skewing is increasing until ~1950 and then has been decreasing until today. That means that there is a more or less big disparity in income between most of the countries and a bunch of countries that have a much more elevated income.
  • From the table of main values for some given years, we can see that the average and the standard deviation of the income increase over time until today. This tells us also that globally, the income has been increasing until today.
  • The time line plots for a given set of countries confirms that the income increased over time for most of the countries. This shows also the income difference between countries and confirm the skewing from the histograms.
  • From the table of main values for some given countries, we can also confirm a huge difference in income between the poor countries (Burkina Faso) and the rich ones (United States).

Life expectency exploration

In [20]:
df_life_exp[years_to_plot].plot(kind='hist',subplots=True, layout=(3,2), title='Distribution of the life expectency around the world for some given years', figsize=(15,10), grid=True, sharey=True, sharex=True);
In [21]:
df_life_exp[years_to_plot].describe()
Out[21]:
Year 1800 1850 1900 1950 2000 2018
count 187.000000 187.000000 187.000000 187.000000 187.000000 187.000000
mean 32.080749 32.241711 34.155080 50.362567 67.486096 72.974866
std 5.971964 6.550874 7.650435 11.281457 9.517975 6.975661
min 23.400000 14.000000 18.400000 23.800000 44.300000 52.400000
25% 29.150000 29.200000 30.250000 40.950000 60.550000 68.100000
50% 31.800000 31.900000 32.700000 50.400000 71.100000 74.100000
75% 34.000000 34.100000 35.700000 59.150000 74.500000 78.350000
max 75.500000 75.500000 75.500000 75.500000 81.400000 85.000000
In [22]:
explore.plotLine(df_life_exp, countries_to_plot, title="Evolution of Life expectency over time", ylabel="Age in Years")
In [23]:
df_life_exp.loc[countries_to_plot].transpose().describe()
Out[23]:
Country Australia Brazil Burkina Faso China France Iraq United States
count 219.000000 219.000000 219.000000 219.000000 219.000000 219.000000 219.000000
mean 54.372603 43.134703 35.704566 41.987215 54.005023 40.792237 54.883562
std 18.052085 15.750797 9.668922 16.349167 16.174065 14.877305 15.233136
min 34.000000 27.400000 12.000000 22.400000 29.400000 26.300000 31.000000
25% 34.000000 32.000000 29.200000 32.000000 40.350000 31.200000 39.400000
50% 54.800000 32.300000 30.700000 32.000000 48.000000 31.600000 51.800000
75% 71.100000 57.800000 40.800000 52.500000 71.250000 52.750000 70.400000
max 82.600000 75.700000 62.100000 77.300000 83.000000 76.900000 78.900000

Life expectency Analysis

  • From the distribution histogram for some given years, we can notice that the life expectency was more or less normalized during time until ~1950. It was a bit skewed on the right before 1950 and from then until now has been more skewed to the left. The skewing is increasing until ~1950 and then has been decreasing until today. We can also noticed that the distribution is always moving to the right, meaning that the life expectency has been increasing over time.
  • From the table of main values for some given years, we can see that the average of the life expectency increase over time until today. This confirms that globally, the life expectency has been increasing until today. The standard deviation increased until ~1950 and has been decreasing from then.
  • The time line plots for a given set of countries confirms that the life expectency has been increasing over time for most of the countries. The trend seems to be the same for most of the country. The life expectency had been stable until a given year and then has been increasing until now. The turning point year is the changing parameter between the countries. This can happen more or less early in time depending on the countries (United States ~1850 vs Burkina Faso ~1950)
  • From the table of main values for some given countries, we can confirm that the average life expectency over time is lower for the countries that started to increase their life expectency later (Burkina Faso) than the one for the country that started earlier (United States).

Research Question 2

In this section, I will explore the multi indexing DataFrame containing the 3 indicators to answer the second question of my report. To do so I will:

  • Display the scatter plot of each indicator against another to try to find some corelations between the indicators.
  • For a set of given countries, display, on a same plot, the evolution of the 3 indicators in time to try to detect some similarities in the trend.

Scatter plots

In [24]:
df_multi.plot(kind='scatter', x='CO2', y='Income', figsize=(8, 8));
In [25]:
df_multi.plot(kind='scatter', x='CO2', y='LifeExp', figsize=(8, 8));
In [26]:
df_multi.plot(kind='scatter', x='LifeExp', y='Income', figsize=(8, 8));

From those 3 plots, we can notice a corelation between the CO2 consumption and the Income, between the Income and the Life Expectency but no corelation between the CO2 consumption and the Life Expectency.

Evolution in time of the 3 indicators

In order to plot the 3 signals on the same plot and be able to see any trend in the evolution of those signal, we first need to compute the ratio of each signal by dividing it by its maximal value (on a country level). This allows us to have the same scale (from 0.0 to 1.0) for the 3 signals instead of different scale (~50 for LifeExp and CO2 and ~100000 for the income).

In [27]:
df_multi['CO2_ratio'] = df_multi['CO2'] / df_multi['CO2'].max(level='Country')
df_multi['Income_ratio'] = df_multi['Income'] / df_multi['Income'].max(level='Country')
df_multi['LifeExp_ratio'] = df_multi['LifeExp'] / df_multi['LifeExp'].max(level='Country')
df_multi
Out[27]:
CO2 Income LifeExp CO2_ratio Income_ratio LifeExp_ratio
Country Year
Afghanistan 1800 0.00192 603.0 28.2 0.004729 0.220073 0.442700
1801 0.00192 603.0 28.2 0.004729 0.220073 0.442700
1802 0.00192 603.0 28.2 0.004729 0.220073 0.442700
1803 0.00192 603.0 28.2 0.004729 0.220073 0.442700
1804 0.00192 603.0 28.2 0.004729 0.220073 0.442700
... ... ... ... ... ... ... ...
Zimbabwe 2014 0.88100 2510.0 58.5 0.444949 0.809677 0.931529
2015 0.88100 2510.0 59.6 0.444949 0.809677 0.949045
2016 0.77100 2490.0 60.5 0.389394 0.803226 0.963376
2017 0.84500 2570.0 61.4 0.426768 0.829032 0.977707
2018 0.85000 2620.0 61.7 0.429293 0.845161 0.982484

40953 rows × 6 columns

In [28]:
for country in countries_to_plot:
    df_multi.loc[country].plot(y=['CO2_ratio', 'Income_ratio', 'LifeExp_ratio'], title="Evolution of the ratio of the indices (% of max) in {}".format(country), figsize=(8,8))

From those plot, we can observe that the 3 indicators tends to start increasing at the same time for each country. But the increase start time is different between the countries.

Conclusions

From this analysis, we can conclude that the CO2 consumption, the life expectency and the income increased globally over time during the last 200 years. We also observed that there always has been some huge disparity of CO2 consumption and Income between the vast majority of the countries in the lower part of those indicators and a few one in the higer values. We observed that all 3 indicators seem to have the same evolution over time. For a given country it seems that they all start increasing around the same time. Finally, we observe a correlation between the CO2 consumption and the Income and between the Income and life expectency.